BONN: Bayesian Optimized Binary Neural Network
71
3.7.3
Bayesian Pruning
After binarizing CNNs, we pruned 1-bit CNNs under the same Bayesian learning framework.
Different channels might follow a similar distribution, based on which similar channels are
combined for pruning. From the mathematical aspect, we achieve a Bayesian formulation of
BNN pruning by directly extending our basic idea in [78], which systematically calculates
compact 1-bit CNNs. We represent the kernel weights of the l-th layer Kl as a tensor
∈RCl
o×Cl
i×Hl×W l, where Cl
o and Cl
i denote the numbers of output and input channels,
respectively, and Hl and W l are the height and width of the kernels, respectively. For
clarity, we define
Kl = [Kl
1, Kl
2, ..., Kl
Cl
o],
(3.104)
where Kl
i, i = 1, 2, ..., Cl
o, is a 3-dimensional filter ∈RCl
i×Hl×W l. For simplicity, l is omitted
from the remainder of this section. To prune 1-bit CNNs, we assimilate similar filters into
the same one based on a controlling learning process. To do this, we first divide K into
different groups using the K-means algorithm and then replace the filters of each group by
their average during optimization. This process assumes that Ki in the same group follows
the same Gaussian distribution during training. Then the pruning problem becomes how
to find the average K to replace all Ki’s, which follows the same distribution. It leads to a
similar problem as in Eq. 3.99. It should be noted that the learning process with a Gaussian
distribution constraint is widely considered in [82].
Accordingly, Bayesian learning is used to prune 1-bit CNNs. We denote ϵ as the difference
between a filter and its mean, i.e., ϵ = K −K, following a Gaussian distribution for
simplicity. To calculate K, we minimize ϵ based on MAP in our Bayesian framework, and
we have
K = arg max
K p(K|ϵ) = arg max
K p(ϵ|K)p(K),
(3.105)
p(ϵ|K) ∝exp(−1
2ν ||ϵ||2
2) ∝exp(−1
2ν ||K −K||2
2),
(3.106)
and p(K) is similar to Eq. 3.101 but with one mode. Thus, we have
min||K −K||2
2 + ν(K −K)T Ψ−1(K −K)
+ ν log
det(Ψ)
,
(3.107)
which is called the Bayesian pruning loss. In summary, our Bayesian pruning solves the
problem more generally, assuming that similar kernels follow a Gaussian distribution and
will finally be represented by their centers for pruning. From this viewpoint, we can obtain
a more general pruning method, which is more suitable for binary neural networks than
the existing ones. Moreover, we take the latent distributions of kernel weights, features, and
filters into consideration in the same framework and introduce Bayesian losses and Bayesian
pruning to improve the capacity of 1-bit CNNs. Comparative experimental results on model
pruning also demonstrate the superiority of our BONNs [287] over existing pruning methods.
3.7.4
BONNs
We employ the three Bayesian losses to optimize 1-bit CNNs, which form our Bayesian
Optimized 1-bit CNNs (BONNs). To do this, we reformulate the first two Bayesian losses